Faculty of Science, Technology, Engineering and Mathematics 
M249 Practical modern statistics 


The Open 
University 


M249 


TMAO03 20175 


Cut-off date 14 March 2018 


You can submit each TMA either by post or electronically using the online 
TMA/EMA service. Please read the guidance under the ‘Assessment’ tab on 
the M249 website. 


Each TMA is marked out of 100. The marks allocated to each part of each 
question are indicated in brackets in the margin. Your overall score for each 
TMA will be the sum of your marks for all questions in that TMA. 


Copyright © 2017 The Open University WEB 05690 5 
1.1 


TMA 03 Cut-off date 14 March 2018 


Questions 1 to 3 below, on Book 3 Multivariate analysis, form tutor-marked 
assignment M249 03. Question 1 (on Part I of Book 3) is marked out of 28, 
Question 2 (on Part II) is marked out of 33, and Question 3 (on Part III) is 
marked out of 39. 


Question 1 — 28 marks 


This question 1s intended to assess your understanding of the use and 
interpretation of graphical and numerical summaries of multivariate data. 
You should be able to answer this question after working through Part I of 
Book 8. 


In this question you will be required to supply SPSS output for parts (a) (1), 
(a)(iii), (a)(tv), (b) (iit) only, though you will be expected to use SPSS to 
answer the rest of the question. In parts (a)(i), (a)(iii), (a) (iv) and (b)(iii), 
you will need to edit default SPSS plots, and you should include only the 
edited plots in your work. All SPSS output should be included in the body of 
your work at the relevant point, and you should include only what is relevant 
to the question and your answer. 


(a) The file protein.sav contains measurements of protein consumption in 
25 European countries for nine food groups. There are 11 variables in 
the data set. Variable country identifies the country for the 
measurements, while area identifies the geographic area in Europe of 
the country, taking possible values 1,...,5, where 1 = Central, 

2 = Northern, 3 = Eastern, 4 = Southern and 5 = Western. The 
remaining nine variables contain the measurements of protein 
consumption, in grams per person per day, for nine food groups: 
redmeat, whitemeat, eggs, milk, fish, cereals, starchyfoods, 
nuts, fruitandveg. 


(i) Obtain a scatterplot of eggs against redmeat, on which it is 
possible to label a point by which country it represents (but do not 
include this plot in your answer). Briefly describe the relationship 
between the two variables. On the plot, label two countries that 
appear to be unusual or extreme, and describe in what way they 
are notable. Include a copy of the annotated scatterplot in your 
answer. 


(ii) What is the correlation coefficient between the variables redmeat 
and eggs? How does this correlation coefficient relate to your 
conclusion in part (a)(i) regarding the relationships between the 
variables? 
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(iii) Obtain a scatterplot of eggs against whitemeat, where the groups 
corresponding to area are identified (but do not include this plot in 
your answer). Edit the scatterplot so that countries in different 
areas are represented using black plotting symbols of different 
shapes. Include a copy of this final scatterplot in your answer. 
Describe any differences that you observe between the areas of 
Europe. |4] 








(iv) Obtain a profile plot of the six countries Denmark, Greece, Italy, 
Norway, Spain and Sweden (but do not include this plot in your 
answer). Edit the profile plot so that the profiles for Denmark, 
Norway and Sweden are all represented by solid black lines, while 
Greece, Italy and Spain have other black line types (they need not 
be the same) that can be distinguished easily in print. Include a 
copy of this edited profile plot in your answer. Which two variables 
vary the most across these countries? Which variable varies the 
least? What does the profile plot tell us about sources of protein in 
Denmark, Norway and Sweden? [7] 





(b) Thirty-six fish were cooked by three methods, and several judges tasted 
fish samples, rating each one in terms of aroma, flavour, texture and 
moisture. The file fishscores.sav contains data for the 36 fish. The 
variable id is the identifying number of each fish. The four variables 
aroma, flavour, texture and moisture are the average judges’ scores 
for aroma, flavour, texture and moisture, respectively, for each fish. 





(i) Obtain the correlation matrix of the variables aroma, flavour, 
texture and moisture, and give the lower triangle of this matrix 
(only) in your answer. Which pair of variables has the strongest 
linear relationship, and which the weakest? [3] 


(ii) Obtain the means and standard deviations of the variables aroma, 
flavour, texture and moisture, and provide a table of them in 
your answer. Briefly explain why these variables do not need to be 
standardized in order to be compared. [2] 


(iii) Obtain a matrix scatterplot of the variables aroma, flavour, 
texture and moisture (but do not include this plot in your 
answer). Identify one fish that has a low score for all four variables. 
Obtain a matrix scatterplot that labels the identification number of 
this fish on all of the plots. Include a copy of the labelled matrix 
scatterplot in your answer. [3] 


(iv) Using the matrix scatterplot, discuss what the correlations show in 
relation to your answer to part (b)(i). [2] 
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Question 2 — 33 marks 


This question 1s intended to assess your use and interpretation of principal 


component analysis applied to a multivariate data set. You should be able to 
answer this question after working through Part II of Book 8. 


In this question you will be required to supply SPSS output for parts (a) (1), 
(a)(tv) and (a)(v) only, though you will be expected to use SPSS to answer 
the rest of the question. In part (a)(v) you will need to edit the default SPSS 
plot, and you should include only the edited plot in your answer. All SPSS 
output should be included in the body of your work at the relevant point, and 


you should include only what is relevant to the question and your answer. 


(a) The file apples.sav contains data collected from an experiment to assess 
the effect of four nitrogen treatments on Jonathan apples. For each of 
42 apple trees, the following variables were measured: total nitrogen 
(TN), protein nitrogen (PN), phosphorus (PH), potassium (PO), calcium 
(CA), magnesium (MA), mean fruit weight (grams) (FW), incidence of 
bitter pit (7%) (BP). Except for the last two, all measurements are in 
parts per million. The variable id contains the identification number 
(from 1 to 42) for each tree, and the variable treatment identifies the 
treatment (labelled 1, 2, 3 or 4) applied to the tree. 


(i) 


(iii) 


(iv) 


Use SPSS to obtain the means and standard deviations of all 
variables (except id and treatment). Include a copy of the table 
produced by SPSS in your answer. Explain why it is sensible for a 
principal component analysis using these variables to be conducted 
only after standardizing the data. 


For the apple with id = 2, the magnesium concentration is 

367 parts per million. Using your answer to part (a)(i), calculate 
the standardized value of magnesium for this apple without using 
SPSS. Show your working. 


Carry out a principal component analysis using the eight variables 
(i.e. all except id and treatment) after they have been 
standardized (but do not include the output from this analysis in 
your answer). Write down the variances of the components, and 
calculate the total variance. Why is the value of the total variance 
not a surprise? 


Obtain the scree plot, and include a copy of it in your answer. How 
many components should be retained on the basis of the scree plot? 
How many should be retained on the basis of Kaiser’s criterion? 
Explain your answers. 


Obtain a scatterplot of the first two principal components, with the 
groups corresponding to treatment identified (but do not include 
this plot in your answer). Edit the scatterplot so that trees with 
different treatments are represented using black plotting symbols of 
different shapes. Include a copy of this edited scatterplot in your 
answer. Briefly discuss how well each of the first two principal 
components separates the trees with the different treatments, and 
how well they do together. 
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(b) This part of the question relates to the data on the fish scores described 
in Question 1(b). The data are in the file fishscores.sav. 


(i) Carry out a principal component analysis using the variables 
aroma, flavour, texture and moisture that have not been 
standardized (but do not include the output from this analysis in 
your answer). Using a suitable table, write down the variances for 
all four components, and calculate the total variance. Extend your 
table to include the percentage variance explained (PVE) and the 
cumulative percentage explained (CPVE). Do not just paste the 
SPSS printout. 


(ii) How many components should be retained if it is important to 





explain at least 80% of the variation? Explain your answer. 


(iii) Write down the loadings produced by SPSS for the first two 
principal components. Calculate the loadings for the first principal 
component, using the constraint that their squares sum to 1. 


(iv) Using the constrained loadings that you calculated in part (b)(ii), 
calculate by hand the value of the first principal component for fish 
number 1 — a fish with scores 5.4, 6, 6.3 and 6.7 for aroma, flavour, 
texture and moisture, respectively. Show your working. In this 
calculation, any values that you substitute in need be correct to 
only three decimal places. 


(v) Interpret the first two principal components using the SPSS 
loadings. 


Question 3 — 39 marks 


This question 1s intended to assess your use and interpretation of canonical 
discriminant analysis, including application of allocation rules. You should 
be able to answer this question after working through Part III of Book 8. 


In this question, you will be required to supply SPSS output for parts (a) (iv) 
and (b) (iit) only, though you will be expected to use SPSS to answer the rest 
of the question. In part b(iii), you will need to edit the default SPSS plot, 
and you should include only the edited plot in your answer. All SPSS output 
should be included in the body of your work at the relevant point, and you 
should include only what is relevant to the question and your answer. 


(a) The file thyroid.sav contains data from five laboratory tests used to 
predict whether a patients thyroid function can be classified as normal, 
hypothyroidism or hyperthyroidism, together with the patients 
diagnosis based on their complete medical record. For each patient the 
following five variables were measured: 


TUT T3-resin uptake test (a percentage), 


STH total serum thyroxine as measured by the isotopic displacement 
method, 


STR total serum triiodothyronine as measured by radioimmunoassay, 


TSH basal thyroid-stimulating hormone as measured by 
radioimmunoassay, 


MAD maximal absolute difference of TSH value after injection of 200 
micrograms of thyrotropin-releasing hormone as compared to the 
basal value. 
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The variable id identifies each patient (from 1 to 215), and the variable 
class contains the patients diagnosis (1 = normal, 2 = hypo, 
3 = hyper). 


(i) 


(iii) 


What is the maximum number of useful discriminant functions that 
can be obtained to separate the three classes of patient diagnosis? 
Would this number increase if the number of variables measured on 
each patient were increased? Briefly justify your answer. 


Obtain and report the mean and variance of the variable TSH 
within each group of patients defined by their diagnosis as 
identified by variable class. Calculate the between-groups variance 
and the within-groups variance, and hence obtain the separation 
achieved by the variable TSH. 





Carry out a discriminant analysis of the thyroid data based on the 

three groups of patient diagnosis defined by variable class (but do 
not include the output in your answer). Report the loadings for the 
discriminant functions based on standardized data. Write down the 
second discriminant function, and interpret it. 


Obtain a stacked histogram of the values of the first discriminant 
function obtained for the three groups defined by the variable 
class. Include a copy of the histogram in your answer, and 
comment on how well the groups are separated. 


Calculate the allocation rule for the three groups defined by the 
variable class based on the first discriminant function. (Assume 
equal costs and equal prior probabilities, and assume that the 
values of the discriminant function are normally distributed with 
equal variance in the three groups.) Hence obtain and report the 
confusion matrix for this allocation rule, keeping in mind the 
original class attributes (1 = normal, 2 = hypo, 3 = hyper). 
Calculate the misclassification rate, and briefly comment on the 
accuracy of the allocation rule. 


This part of the question relates to the nitrogen treatments on Jonathan 
apples described in Question 2(a). The data are in apples.sav. 


(i) 


Carry out a discriminant analysis using standardized data of all 
variables (except id and treatment) with treatment as the 
grouping variable (but do not include the output from this analysis 
in your answer). Write down the separation and percentage 
separation achieved by each discriminant function. On the basis of 
the percentage of total separation available, discuss how many 
discriminant functions are needed to separate the groups. 


Obtain the loadings for the discriminant functions based on 
standardized data. Write down the first discriminant function, and 
interpret it. 
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(iii) Obtain a scatterplot of the first two discriminant functions, with 
the groups corresponding to treatment identified (but do not 
include the plot in your answer). Edit the scatterplot so that the 
trees with different treatments are represented using black plotting 
symbols of different shapes. Include a copy of this edited 
scatterplot in your answer. Identify two sets of pairs of treatments 
that separate well in the scatterplot. Explain your answers fully. In 
your view, are the trees for the four treatments well separated by 
the first two discriminant functions? 
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